# can add quietly=T option to the require() function
loadPkg = function(x) { if (!require(x,character.only=T, quietly =T)) { install.packages(x,dep=T,repos="http://cran.us.r-project.org"); if(!require(x,character.only=T)) stop("Package not found") } }
Based on The World Health Organization (WHO), Cardiovascular diseases (CVDs) are disorders related to the heart and blood vessels. The diseases mainly caused by fatty deposits plaque builds up on the inner walls of the blood vessels which prevent prevents blood from flowing to the heart or brain.
The process of fatty plaque formation.
According to 2016 report, cardiovascular disease remains the leading cause of death in the United States (Benjamin et al., 2019). Around 80% of CVD deaths are a heart attack and stroke. The cause of cardiovascular diseases is usually the presence of a combination of risk factors, such as unhealthy diet, obesity, physical inactivity, tobacco use and harmful use of alcohol.
Body mass index (BMI) is a value calculated from the weight and height of a person, the equation is kg(kilogram)/m(meter)^2. It is a measurement to assess a person’s total amount of body fat. As measuring BMI only needs a person’s weight and height, it has been widely used in public health and clinical settings.
Since there are many reports indicated that the cause of cardiovascular diseases is associated with our BMI and lifestyle. Therefore, we want to evaluate whether these factors truly correlate with the developing of the disease.
The source data for our EDA is a CSV containing 70,000 records of patients data in 12 features: age, height, weight, gender, systolic blood pressure, diastolic blood pressure, cholesterol, glucose, smoking, alcohol intake, physical activity, and presence or absence of cardiovascular disease. (https://www.kaggle.com/sulianova/cardiovascular-disease-dataset)
We noticed that variable ‘age’ is int(day), which were converted into int(years).As height and weight individually do not mean much to patients’ health, so we calculated Body Mass Index (BMI), a measure of body fat based on height and weight that applies to adult men and women, and added it as a feature. Also column ‘id’ was droped.
## age gender height weight ap_hi
## Min. :30.00 1:45530 Min. : 55.0 Min. : 10.00 Min. : -150.0
## 1st Qu.:48.00 2:24470 1st Qu.:159.0 1st Qu.: 65.00 1st Qu.: 120.0
## Median :54.00 Median :165.0 Median : 72.00 Median : 120.0
## Mean :53.34 Mean :164.4 Mean : 74.21 Mean : 128.8
## 3rd Qu.:58.00 3rd Qu.:170.0 3rd Qu.: 82.00 3rd Qu.: 140.0
## Max. :65.00 Max. :250.0 Max. :200.00 Max. :16020.0
## ap_lo cholesterol gluc smoke alco active
## Min. : -70.00 1:52385 1:59479 0:63831 0:66236 0:13739
## 1st Qu.: 80.00 2: 9549 2: 5190 1: 6169 1: 3764 1:56261
## Median : 80.00 3: 8066 3: 5331
## Mean : 96.63
## 3rd Qu.: 90.00
## Max. :11000.00
## cardio bmi
## 0:35021 Min. : 3.472
## 1:34979 1st Qu.: 23.875
## Median : 26.374
## Mean : 27.557
## 3rd Qu.: 30.222
## Max. :298.667
We noticed that the min value of systolic blood pressure(ap_hi) and diastolic blood pressure (ap_lo) are negative values, which do not make sense. In addition, diastolic blood pressure is supposed to be lower than systolic blood pressure. The data were further cleaned based on these crterion.
Then the distribution of age, height, weight, ap_hi and ap_lo was checked.
The histogram of age shows that there are only few observation for age<35, which could not represent the population of age<35, so the observations with age<35 were droped. For height, weight, ap_hi, and ap_lo, the histograms were way skewed by some extreme outliers, which were droped in this step.
The distribution of age, height, weight, ap_hi and ap_lo was checked again after outliers removed.
The correlation matrix was displayed to get an idea of the correlations among different variables.
## corrplot 0.84 loaded
We noticed that bmi is positively correlated with both ap_hi and ap_lo; ap_hi is positively correlated with age;and ap_hi and ap_lo are of course highly positively correlated.
What are the risk factors of cardiovascular diseases? Is gender, BMI, cholesterol level, glucose level, smoking, alcohol over-consumption and lack of exercise correlated to the development of cardiovascular disease?
First, we categorize BMI values into 5 groups, starting from numerical value ranging (1-10), and then add the numerical value by 10, sequentially. Chi-square test is being used as the preferred method, as it shows the correlation between cardiovascular disease and BMI, age, cholesterol level, glucose level, smoking behavior, alcohol over-consumption, and lack of exercise.
##
## Pearson's Chi-squared test
##
## data: cardio_bmi
## X-squared = 1546.2, df = 4, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: cardio_glucose
## X-squared = 476.48, df = 2, p-value < 2.2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cardio_smoke
## X-squared = 30.402, df = 1, p-value = 3.511e-08
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cardio_alco
## X-squared = 9.4869, df = 1, p-value = 0.002069
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cardio_active
## X-squared = 88.188, df = 1, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: cardio_age
## X-squared = 2974.3, df = 3, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: cardio_cholesterol
## X-squared = 2976.9, df = 2, p-value < 2.2e-16
The null hypothesis is rejected as all p-values are small. All factors indicated above are considered to be risk factors of cardiovascular disease.
In addition to Chapter 2, a bar plot is generated to show the relationship between age and the onset of cardiovascular disease.
As the bar plot shows, the number of elderly with cardiovascular disease is higher than the number of younger people with cardiovascular disease. We will discuss how age may affect cardiovascular disease in further detail in Chapter 5.
What is the relationship between BMI and cardiovascular diseas and what factors will affect bmi?
A bar plot is generated to show the relationship between age and the onset of cardiovascular disease.
By comparing the BMI group with the incidence of getting cardiovascular disease, we conclude that people with higher BMI are more likely to develop cardiovascular disease. Likewise, people with cardiovascular disease are also more likely to have higher BMI.
In addition, we subset cardiovascular disease group, with [cardio0] for people without cardiovascular disease and [cardio1] for people with cardiovascular disease. Then, we compare the mean and histogram between people with cardiovascular disease and without cardiovascular disease.
## [1] 26.31278
## [1] 27.94742
From both mean and histogram, we observe that people with cardiovascular diseases tend to have higher BMI than people without cardiovascular diseases. We conclude BMI and cardiovascular have correlation.
We will go on with the hypothesis that risk factors of high BMI value will also be the risk factors of cardiovascular diseases.
We fisrt use chi-test to see the relationship between BMIgroup and glucose, smoke, alchol level, exercise.
##
## Pearson's Chi-squared test
##
## data: bmi_glucose
## X-squared = 637.17, df = 8, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: bmi_smoke
## X-squared = 124.54, df = 4, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: bmi_alco
## X-squared = 53.676, df = 4, p-value = 6.153e-11
##
## Pearson's Chi-squared test
##
## data: bmi_active
## X-squared = 8.5446, df = 4, p-value = 0.07355
The null hypothesis for high glucose level, smoking behavior, alcohol over-consumption, and lack of exercise are rejected as all p-values are small. We conclude high BMI is more correlated to cardiovascular disease.
In addition, the analysis indicates BMI level is associated with the presence of cardiovascular disease. Smoking behavior, alcohol-consumption, and glucose level are also associated with the onset of cardiovascular disease.
Furthermore, we fail to reject H0 as the p-value for activity level is greater than 0.05. There is no impact on BMI among people who are active and people who are not active.
We use t-test and boxplot to analyze BMI and subset each risk factors. Specifically, we subset women and men for gender, smoke (0) and not smoke (1) for smoking behavior, noalco (0) and alco (1) for alcohol consumption, and noactive (0) and active (1) for activity level. Boxplot helps us to find groups with higher BMI.
##
## Welch Two Sample t-test
##
## data: cardio0$bmi and cardio1$bmi
## t = -45.412, df = 61629, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.705197 -1.564092
## sample estimates:
## mean of x mean of y
## 26.31278 27.94742
##
## Welch Two Sample t-test
##
## data: women$bmi and men$bmi
## t = 36.099, df = 55336, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.196799 1.334223
## sample estimates:
## mean of x mean of y
## 27.56122 26.29571
##
## Welch Two Sample t-test
##
## data: not_smoke$bmi and smoke$bmi
## t = 11.943, df = 6773.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5853395 0.8152236
## sample estimates:
## mean of x mean of y
## 27.18058 26.48030
##
## Welch Two Sample t-test
##
## data: noalco$bmi and alco$bmi
## t = -2.8869, df = 3647, p-value = 0.003914
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.38919138 -0.07436675
## sample estimates:
## mean of x mean of y
## 27.10803 27.33981
##
## Welch Two Sample t-test
##
## data: noactive$bmi and active$bmi
## t = 2.1267, df = 18396, p-value = 0.03346
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.007756491 0.190271227
## sample estimates:
## mean of x mean of y
## 27.19976 27.10075
We conclude that no smoking, alcohol consumption, and female gender are contributed to higher BMI. In addition, the discussed characteristics has less impact on BMI among people who are active than people who are not active. Because cardiovascular diseases have correlation with BMI, People who no smoking, drink alcohol, and female gender also be the risk factors of cardiovascular diseases.
Are the mean values of different factors such as systolic blood pressure and diastolic blood pressure same across age group?
We mention age and cardio diseases have relationship. We we discuss about age and bloos pressure.
H0: The mean values of Systolic blood pressure are the same across all agegroup.
H1: The mean values of Systolic blood pressure are different across age groups.
ANOVA and TukeyHSD are used to test the hypothesis and calculate the p-value. The diagram blow summarizes the results.
## Df Sum Sq Mean Sq F value Pr(>F)
## ageGroup 3 454301 151434 769.1 <2e-16 ***
## Residuals 62495 12305612 197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = ap_hi ~ ageGroup, data = cardio)
##
## $ageGroup
## diff lwr upr p adj
## 45-54-35-44 4.0170952 3.569310 4.464880 0.0000000
## 55-64-35-44 7.7224200 7.281768 8.163072 0.0000000
## 65-74-35-44 7.5728352 5.527256 9.618414 0.0000000
## 55-64-45-54 3.7053248 3.392737 4.017912 0.0000000
## 65-74-45-54 3.5557400 1.533877 5.577603 0.0000370
## 65-74-55-64 -0.1495848 -2.169880 1.870710 0.9975614
Based on tukeyHSD, we conclude that people in age (65 ~74) and (55 ~64) have different systolic blood pressure.
The null hypothesis is rejected as all p-values are small. We conclude that the mean values of Systolic blood pressure are different across all age groups.
H0: The mean values of diastolic blood pressure are the same across all agegroup.
H1: The mean values of diastolic blood pressure are different across age groups.
ANOVA and TukeyHSD are used to test the hypothesis and calculate the p-value. The diagram blow summarizes the results.
## Df Sum Sq Mean Sq F value Pr(>F)
## ageGroup 3 70521 23507 407.1 <2e-16 ***
## Residuals 62495 3608599 58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = ap_lo ~ ageGroup, data = cardio)
##
## $ageGroup
## diff lwr upr p adj
## 45-54-35-44 1.8768690 1.63438260 2.1193553 0.0000000
## 55-64-35-44 3.1436368 2.90501313 3.3822604 0.0000000
## 65-74-35-44 3.0355532 1.92782310 4.1432833 0.0000000
## 55-64-45-54 1.2667678 1.09749421 1.4360414 0.0000000
## 65-74-45-54 1.1586842 0.06379689 2.2535715 0.0331826
## 65-74-55-64 -0.1080836 -1.20212193 0.9859547 0.9942677
Based on tukeyHSD, we conclude that people in age (65~74) and (55~64) have different diastolic blood pressure.
The null hypothesis is rejected as all p-values are small. We conclude that the mean values of diastolic blood pressure are different across all age groups.
We can conclude that age will affect blood pressure. High blood pressure increases the incidence of getting cardiovascular disease.
Being overweight or obese substantially increases your risk of developing cardiovascular disease. However, researchers don’t always agree which method is best for quantifying whether an individual is “too” overweight. So, this section will analyze if the BMI is the best meacurement at predicting risk.
Is BMI the best measurement at predicting risk in every individual? If not, is there another measurement to help to predict the risk of cardiovascular disease?
To find the adult weight classification, see which of these BMI ranges the weight falls into:
| BMI | adult weight classification |
|---|---|
| [0, 18.5) kg/m^2 | underWeight |
| [18.5, 25) kg/m^2 | normalWeight |
| [25, 30) kg/m^2 | overWeight |
| [30, 35) kg/m^2 | obese |
| [35, 45) kg/m^2 | severelyObese |
| [45, 50) kg/m^2 | morbidlyObese |
| [50, Inf) kg/m^2 | superObese |
The bar chart pressents the incidence of cardiovascular disease in groups of different BMI levels. From the figure we can see that before the severelyObese level, as the BMI parameter continues to increase, the incidence of cardiovascular disease gradually increases. When the BMI level exceeds severelyObese level, the morbidity rate tends to be stable and has a slow downward trend. However, BMI is not always accurate in every individual. It overestimates body fat in people with a lot of muscle mass and tends to underestimate it in elderly people. So, the idea of using waist circumference as a risk predictor stems from the fact comes up.
Carrying excess body fat around your middle is more of a health risk than if weight is on your hips and thighs. In that case, waist circumference is a better estimate of visceral fat, the dangerous internal fat that coats the organs.
An initial model expressed the regression of WC on BMI in the following form:
WCi = b0 + b1BMIi + b2AGEi + b3BLACKi + b4HISPi + ei
where i indexes individuals, WCi is waist circumference for individual i, BMIi is body mass index, AGEi is current age (in years), BLACKi is an indicator for African-American, HISPi is an indicator for Hispanic ethnicity, and ei is the residual.
For women the pattern was better summarized by using one constant for age<35years and a separate intercept and slope for age≥35years. Thus, the model for women was
WCi = c0 + c1BMIi + c2I{AGEi ≥ 35} + c3AGEi × I{AGEi ≥ 35} + c4BLACKi + c5HISPi + ei
where I{B} is an indicator function: I{B} = 1 when B is true and 0 otherwise.
After the prediction, the sample of the data is as follows:
## gender bmi predict_waist
## 1 2 21.96712 85.90547
## 2 1 34.92768 108.87168
## 3 1 23.50781 83.16440
## 4 2 28.71048 102.58695
## 6 1 29.38468 97.20714
Studies have shown that a waist circumference of 95cm or more in men, and of 88cm or more in women, is associated with elevated cardiovascular risk. So, we use these parameters as cuf off line for each gender.
A Body Mass Index of 25kg/m^2 or more is defined as obese, which means the risk of having cardiovascular disease is higher.
At the same time, two parameters are defined here: “safe area” and “warning area”. When both waist circumference and BMI parameters are lower than the cut off line and obese parameters, the result is safe area, otherwise the result is warnning area.
The statistical results of cardiovascular disease after cut off and obese classification are as follows:
| gender | obese | cut_off | bmi_waist | cardio | n |
|---|---|---|---|---|---|
| 1 | normal weight | below cut off | safe area | 0 | 8920 |
| 1 | normal weight | below cut off | safe area | 1 | 6057 |
| 1 | normal weight | over cut off | warning area | 1 | 2 |
| 1 | obese | below cut off | warning area | 0 | 1285 |
| 1 | obese | below cut off | warning area | 1 | 785 |
| 1 | obese | over cut off | warning area | 0 | 10329 |
| 1 | obese | over cut off | warning area | 1 | 13337 |
| 2 | normal weight | below cut off | safe area | 0 | 5383 |
| 2 | normal weight | below cut off | safe area | 1 | 3579 |
| 2 | normal weight | over cut off | warning area | 0 | 47 |
| 2 | normal weight | over cut off | warning area | 1 | 83 |
| 2 | obese | below cut off | warning area | 0 | 648 |
| 2 | obese | below cut off | warning area | 1 | 376 |
| 2 | obese | over cut off | warning area | 0 | 5019 |
| 2 | obese | over cut off | warning area | 1 | 6649 |
From the three obesity vs cardiovascular disease bar plots, the risk of having cardiovascular disease in all genders is 15.6% if they have normal weight, and 33.8% if they are obese. In only women section, the risk of having cardiovascular disease is 14.9% if she has normal weight, and 34.7% if she is obese. In only men section, the risk of having cardiovascular disease is 16.8% if he has normal weight, and 32.2% if he is obese.
From the three cut off line vs cardiovascular disease bar plots, the risk of having cardiovascular disease in all genders is 17.3% if the waist circumference is below cut off line, and 32.1% if the waist circumference is over cut off line. In only women section, the risk of having cardiovascular disease is 16.8% if her waist circumference is below cut off line, and 32.8% if her waist circumference is over cut off line. In only men section, the risk of having cardiovascular disease is 18.2% if his waist circumference is below cut off line, and 30.9% if his waist circumference is over cut off line.
From the three safe/warning area vs cardiovascular disease bar plots, the risk of having cardiovascular disease in all genders is 15.4% if he/she is in safe area, and 34% if he/she is in warning area. In only women section, the risk of having cardiovascular disease is 14.9% if she is in safe area, and 34.7% if she is in warning area. In only men section, the risk of having cardiovascular disease is 16.4% if he is in safe area, and 32.6% if he is in warning area.
What are the risk factors of cardiovascular diseases? Are all the variables correlated to the development of cardiovascular disease?
##
## Call:
## glm(formula = cardio ~ gender + age + ap_hi + ap_lo + cholesterol +
## bmi + gluc + smoke + alco + active, family = "binomial",
## data = cardio)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0174 -0.9166 -0.3889 0.9299 2.5614
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.512389 0.141718 -88.291 < 2e-16 ***
## gender2 0.019639 0.020592 0.954 0.340
## age 0.050466 0.001418 35.596 < 2e-16 ***
## ap_hi 0.062213 0.001047 59.435 < 2e-16 ***
## ap_lo 0.015256 0.001765 8.645 < 2e-16 ***
## cholesterol2 0.361843 0.028912 12.515 < 2e-16 ***
## cholesterol3 1.088438 0.037622 28.931 < 2e-16 ***
## bmi 0.029144 0.002137 13.638 < 2e-16 ***
## gluc2 0.007199 0.038427 0.187 0.851
## gluc3 -0.319649 0.041611 -7.682 1.57e-14 ***
## smoke1 -0.158285 0.036725 -4.310 1.63e-05 ***
## alco1 -0.217539 0.044819 -4.854 1.21e-06 ***
## active1 -0.238012 0.022943 -10.374 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 86633 on 62498 degrees of freedom
## Residual deviance: 70251 on 62486 degrees of freedom
## AIC: 70277
##
## Number of Fisher Scoring iterations: 4
All the coefficients, except gender, are found significant (small p-values). Thus, gender is dropped.
##
## Call:
## glm(formula = cardio ~ age + ap_hi + ap_lo + cholesterol + bmi +
## gluc + smoke + alco + active, family = "binomial", data = cardio)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0186 -0.9168 -0.3893 0.9302 2.5593
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.505617 0.141526 -88.362 < 2e-16 ***
## age 0.050446 0.001418 35.585 < 2e-16 ***
## ap_hi 0.062242 0.001046 59.486 < 2e-16 ***
## ap_lo 0.015304 0.001764 8.676 < 2e-16 ***
## cholesterol2 0.360920 0.028895 12.491 < 2e-16 ***
## cholesterol3 1.087543 0.037610 28.916 < 2e-16 ***
## bmi 0.028883 0.002119 13.628 < 2e-16 ***
## gluc2 0.007304 0.038427 0.190 0.849
## gluc3 -0.319628 0.041613 -7.681 1.58e-14 ***
## smoke1 -0.147992 0.035104 -4.216 2.49e-05 ***
## alco1 -0.214920 0.044734 -4.804 1.55e-06 ***
## active1 -0.238103 0.022942 -10.378 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 86633 on 62498 degrees of freedom
## Residual deviance: 70252 on 62487 degrees of freedom
## AIC: 70276
##
## Number of Fisher Scoring iterations: 4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## GVIF Df GVIF^(1/(2*Df))
## gender 1.152808 1 1.073689
## age 1.015821 1 1.007880
## ap_hi 1.748779 1 1.322414
## ap_lo 1.729927 1 1.315267
## cholesterol 1.500081 2 1.106697
## bmi 1.063761 1 1.031388
## gluc 1.483287 2 1.103586
## smoke 1.244471 1 1.115559
## alco 1.139847 1 1.067636
## active 1.002560 1 1.001279
Here we use GVIF to check whether collinearity is a problem in this logistic regression model. Typically, GVIF only comes into play for factors and polynomial variables. Variables which require more than 1 coefficient and thus more than 1 degree of freedom are typically evaluated using the GVIF. For one-coefficient terms VIF equals GVIF. The rule of GVIF2(1/(2×Df))<2 is applied, which would equal a VIF of 4 for one-coefficient variables. Thus, here in our logistic regression model, collinearity is not a problem, and all the coefficients, are found significant (small p-values).
The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.
## ResourceSelection 0.3-5 2019-07-22
## Warning in Ops.factor(1, y): '-' not meaningful for factors
The result is shown here:
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: cardio$cardio, fitted(cardiologit)
## X-squared = 62499, df = 8, p-value < 2.2e-16
The p-value of 0 is smaller than 0.05. This indicates the model is a good fit
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7874
We have here the area-under-curve of 0.7873812, which is slightly less than 0.8. This test evaluates the model as a not so good fit.
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## llh llhNull G2 McFadden r2ML
## -3.512576e+04 -4.331635e+04 1.638118e+04 1.890877e-01 2.305682e-01
## r2CU
## 3.074396e-01
With the McFadden value of 0.1890877, which is analgous to the coefficient of determination R\(2\), about 18.9% of the variations in cardio is explained by the explanatory variables in the model.
According to the three model evaluation, this logistic regression is a relatively ok model.
When we perform Ridge or Lasso regression, as with most other cases, standardization of variables (z-scores is a typical choice) is very important.
Here we introduced a function uzsale(df, append=0, excl=NULL) which will convert all numerical values to the respective z-scores. The base R library can do that too, but this new function is safe with categorical variable as well, and added some choice options.
## Warning: Column `age` has different attributes on LHS and RHS of join
## Warning: Column `ap_hi` has different attributes on LHS and RHS of join
## Warning: Column `ap_lo` has different attributes on LHS and RHS of join
## Warning: Column `bmi` has different attributes on LHS and RHS of join
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 3.0-1
## [1] 0.02169438
## (Intercept) (Intercept) age ap_hi ap_lo cholesterol2
## -0.07049091 0.00000000 0.25298104 0.72162713 0.19205375 0.31928457
## cholesterol3 gluc2 gluc3 smoke1 alco1 active1
## 0.92053303 0.03557552 -0.18581276 -0.13156551 -0.18546758 -0.20772134
## bmi
## 0.11450625
## (Intercept) age ap_hi ap_lo cholesterol2 cholesterol3
## -0.07049091 0.25298104 0.72162713 0.19205375 0.31928457 0.92053303
## gluc2 gluc3 smoke1 alco1 active1 bmi
## 0.03557552 -0.18581276 -0.13156551 -0.18546758 -0.20772134 0.11450625
The best λ for Ridge regression is almost 0, which gives the least square fit. All the coefficiences are used, which agrees with the GVIF gained in Chapter 3 that there is no multicollinearity in the predictors.
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values
## [1] 0.0008964143
## (Intercept) (Intercept) age ap_hi ap_lo cholesterol2
## -0.009655563 0.000000000 0.296961528 0.846433715 0.088783375 0.193588280
## cholesterol3 gluc2 gluc3 smoke1 alco1 active1
## 0.759340280 0.000000000 0.000000000 -0.015708352 -0.015541297 -0.109079165
## bmi
## 0.099056559
## (Intercept) age ap_hi ap_lo cholesterol2 cholesterol3
## -0.009655563 0.296961528 0.846433715 0.088783375 0.193588280 0.759340280
## smoke1 alco1 active1 bmi
## -0.015708352 -0.015541297 -0.109079165 0.099056559
The best λ for Lasso regression is almost 0, which gives the least square fit and agrees with Ridge Regression. Very similar to Ridge, but Lasso regression often forces many parameters to be exactly zero. This makes Lasso Regression a good feature selection tool as well. In this case, gender and gluc are dropped from coefficients to avoid overfitting and this result match with the p-value calculated from full model of logictic regression.
##
## Attaching package: 'gmodels'
## The following object is masked from 'package:pROC':
##
## ci
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 18750
##
##
## | cardio_1NN
## cardio.testLabels | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 5928 | 3619 | 9547 |
## | 0.621 | 0.379 | 0.509 |
## | 0.630 | 0.387 | |
## | 0.316 | 0.193 | |
## ------------------|-----------|-----------|-----------|
## 1 | 3475 | 5728 | 9203 |
## | 0.378 | 0.622 | 0.491 |
## | 0.370 | 0.613 | |
## | 0.185 | 0.305 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 9403 | 9347 | 18750 |
## | 0.501 | 0.499 | |
## ------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 18750
##
##
## | cardio_3NN
## cardio.testLabels | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 6365 | 3182 | 9547 |
## | 0.667 | 0.333 | 0.509 |
## | 0.660 | 0.349 | |
## | 0.339 | 0.170 | |
## ------------------|-----------|-----------|-----------|
## 1 | 3273 | 5930 | 9203 |
## | 0.356 | 0.644 | 0.491 |
## | 0.340 | 0.651 | |
## | 0.175 | 0.316 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 9638 | 9112 | 18750 |
## | 0.514 | 0.486 | |
## ------------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 18750
##
##
## | cardio_5NN
## cardio.testLabels | 0 | 1 | Row Total |
## ------------------|-----------|-----------|-----------|
## 0 | 6605 | 2942 | 9547 |
## | 0.692 | 0.308 | 0.509 |
## | 0.676 | 0.328 | |
## | 0.352 | 0.157 | |
## ------------------|-----------|-----------|-----------|
## 1 | 3169 | 6034 | 9203 |
## | 0.344 | 0.656 | 0.491 |
## | 0.324 | 0.672 | |
## | 0.169 | 0.322 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 9774 | 8976 | 18750 |
## | 0.521 | 0.479 | |
## ------------------|-----------|-----------|-----------|
##
##
## [1] 0.6216533
## [1] 0.6557333
## [1] 0.67408
The accuracies for k=1 is 62.2%; for k=3 is 65.6%; for k=5 is 67.4%.
How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”
## num [1:2, 1:6] 1 0.622 3 0.656 5 ...
It seems 7-nearest neighbors is an efficient choice because that’s the greatest improvement in predictive accuracy before the incremental improvement trails off. The accuracies for k=7 is 68.5%.
cardiodtfit <- rpart(cardio ~ age + gender + ap_hi + ap_lo + cholesterol + bmi + gluc + smoke + alco + active, method="class", data=cardio)
printcp(cardiodtfit) # display the results
##
## Classification tree:
## rpart(formula = cardio ~ age + gender + ap_hi + ap_lo + cholesterol +
## bmi + gluc + smoke + alco + active, data = cardio, method = "class")
##
## Variables actually used in tree construction:
## [1] age ap_hi cholesterol
##
## Root node error: 30868/62499 = 0.4939
##
## n= 62499
##
## CP nsplit rel error xerror xstd
## 1 0.40926 0 1.00000 1.00000 0.0040492
## 2 0.01001 1 0.59074 0.59074 0.0036816
## 3 0.01000 3 0.57072 0.58034 0.0036622
plotcp(cardiodtfit) # visualize cross-validation results
summary(cardiodtfit) # detailed summary of splits
## Call:
## rpart(formula = cardio ~ age + gender + ap_hi + ap_lo + cholesterol +
## bmi + gluc + smoke + alco + active, data = cardio, method = "class")
## n= 62499
##
## CP nsplit rel error xerror xstd
## 1 0.40925878 0 1.0000000 1.0000000 0.004049167
## 2 0.01001037 1 0.5907412 0.5907412 0.003681571
## 3 0.01000000 3 0.5707205 0.5803421 0.003662230
##
## Variable importance
## ap_hi ap_lo age cholesterol bmi gluc
## 49 29 8 8 5 1
##
## Node number 1: 62499 observations, complexity param=0.4092588
## predicted class=0 expected loss=0.4938959 P(node) =1
## class counts: 31631 30868
## probabilities: 0.506 0.494
## left son=2 (37556 obs) right son=3 (24943 obs)
## Primary splits:
## ap_hi < 129.5 to the left, improve=5583.6270, (0 missing)
## ap_lo < 85.5 to the left, improve=3490.2120, (0 missing)
## cholesterol splits as LRR, improve=1281.9080, (0 missing)
## age < 54.5 to the left, improve=1198.3770, (0 missing)
## bmi < 27.43782 to the left, improve= 728.8057, (0 missing)
## Surrogate splits:
## ap_lo < 84.5 to the left, agree=0.834, adj=0.583, (0 split)
## cholesterol splits as LRR, agree=0.645, adj=0.109, (0 split)
## bmi < 29.66726 to the left, agree=0.639, adj=0.095, (0 split)
## age < 61.5 to the left, agree=0.615, adj=0.036, (0 split)
## gluc splits as LRR, agree=0.607, adj=0.016, (0 split)
##
## Node number 2: 37556 observations, complexity param=0.01001037
## predicted class=0 expected loss=0.321653 P(node) =0.6009056
## class counts: 25476 12080
## probabilities: 0.678 0.322
## left son=4 (22650 obs) right son=5 (14906 obs)
## Primary splits:
## age < 54.5 to the left, improve=752.7512, (0 missing)
## cholesterol splits as LLR, improve=588.2769, (0 missing)
## ap_hi < 118.5 to the left, improve=208.8289, (0 missing)
## bmi < 27.8864 to the left, improve=139.2997, (0 missing)
## ap_lo < 77.5 to the left, improve=136.1550, (0 missing)
## Surrogate splits:
## cholesterol splits as LLR, agree=0.618, adj=0.038, (0 split)
## gluc splits as LLR, agree=0.609, adj=0.014, (0 split)
## bmi < 37.80563 to the left, agree=0.604, adj=0.002, (0 split)
## ap_lo < 93.5 to the left, agree=0.603, adj=0.001, (0 split)
##
## Node number 3: 24943 observations
## predicted class=1 expected loss=0.2467626 P(node) =0.3990944
## class counts: 6155 18788
## probabilities: 0.247 0.753
##
## Node number 4: 22650 observations
## predicted class=0 expected loss=0.2404415 P(node) =0.3624058
## class counts: 17204 5446
## probabilities: 0.760 0.240
##
## Node number 5: 14906 observations, complexity param=0.01001037
## predicted class=0 expected loss=0.4450557 P(node) =0.2384998
## class counts: 8272 6634
## probabilities: 0.555 0.445
## left son=10 (13368 obs) right son=11 (1538 obs)
## Primary splits:
## cholesterol splits as LLR, improve=224.52640, (0 missing)
## age < 60.5 to the left, improve=143.42590, (0 missing)
## bmi < 29.27448 to the left, improve= 51.64790, (0 missing)
## ap_hi < 118.5 to the left, improve= 50.34354, (0 missing)
## active splits as RL, improve= 38.03618, (0 missing)
## Surrogate splits:
## gluc splits as LLR, agree=0.919, adj=0.218, (0 split)
##
## Node number 10: 13368 observations
## predicted class=0 expected loss=0.4156194 P(node) =0.2138914
## class counts: 7812 5556
## probabilities: 0.584 0.416
##
## Node number 11: 1538 observations
## predicted class=1 expected loss=0.2990897 P(node) =0.02460839
## class counts: 460 1078
## probabilities: 0.299 0.701
# plot tree
plot(cardiodtfit, uniform=TRUE, main="Classification Tree for cardio")
text(cardiodtfit, use.n=TRUE, all=TRUE, cex=.8)
We also use caret library to calculate these percentages in the confusion matrix.
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## [1] "Overall: "
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.181235e-01 4.351948e-01 7.145781e-01 7.216486e-01 5.061041e-01
## AccuracyPValue McnemarPValue
## 0.000000e+00 1.849779e-239
## [1] "Class: "
## Sensitivity Specificity Pos Pred Value
## 0.7908697 0.6435791 0.6945416
## Neg Pred Value Precision Recall
## 0.7501983 0.6945416 0.7908697
## F1 Prevalence Detection Rate
## 0.7395823 0.5061041 0.4002624
## Detection Prevalence Balanced Accuracy
## 0.5762972 0.7172244
The overall accuracy is 71.8%.
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
In summary, our analysis indicated that BMI level and smoking are associated with the risk of cardiovascular diseases. Moreover, age, gender, smoking, blood glucose level, and alcohol use have an impact on BMI level.
We further subgroup the BMI value into weight classes cataloged by NIH. The plot showed that higher BMI values tend to have an increased risk of cardiovascular disease. Another way to measure in Cardiovascular disease risk is through waist circumference (WC). Abdominal obesity is a well-researched risk factor for CVD and is being suggested to be used in adjunct with BMI to determine a person’s CVD risk. We further predict WC value using a specific formula and set the parameter for plotting (Bozeman et al., 2012), our result indicated that WC value is also a good variable for predicting cardiovascular diseases.
There are many risk factors in Cardiovascular diseases, studies suggested that the genetic variances in patients have an impact on the development of the diseases. Furthermore, a persons’ family with cardiovascular diseases also increased their risk (Kathiresan & Srivastava, 2012). According to Harvard Health Publishing, the rates of high blood pressure, diabetes, and heart disease vary among people of different races and living countries. Therefore, the dataset could include patients’ family background, race, and ethnicity as additional variables for analyzing cardiovascular diseases.
BMI values have a higher correlation with cardiovascular disease. Risk factors that contribute to high BMI value also contribute to the onset of cardiovascular disease. From Chapter 3, we conclude that people with cardiovascular disease, female gender, no smoking and alcohol over-consumption behavior, and inactive, tend to have higher BMI value. From chapter 2-5, we conclude elderly people are more likely to have higher BMI values, and people who have higher BMI values have a higher risk of having cardiovascular disease. At the same time, in chapter 6, we listed three methods for predicting cardiovascular disease and found that the best prediction method is: use BMI and waist circumference. Having these two data can help people judge the probability of getting sick according to their physical condition.
For the model building and evaluation, we used the logistic regression function to analyze the relationship between CVD and other risk factors. The regression model indicated that age, blood pressure, and the level of cholesterol is correlated with CVD. It is worth mentioning that the higher cholesterol level is strongly associated with an increased risk of CVD. Furthermore, the vif evaluation suggested that the variables have no multicollinearity, the Hosmer and Lemeshow test also showed that the model is a good fit.
In our logistic regression, we want to know how variables influence cardiovascular diseases, we run the full model including gender, age, high blood pressure, low blood pressure, cholesterol level, bmi, glcose level, smoking, drinking alchol, and doing exercise. All the coefficients are found significant (small p-values) except gluc2, it may because gluc2 have small difference with gluc1 and gluc3 have large difference with gluc1. gender, age, low blood pressure, high blood pressure, cholesterol and bmi have positive effects on cardiovascular diseases (cadio = 1), while smoking, drinking alchol, do active have negatively affect the cardiovascular diseases. These are reasonable results and confirms our common beliefs. We use Hosmer and Lemeshow tes to evaluate logistic regression fit, Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity) and McFadden is evaluation tool we can use on logit regressions. First, in Hosmer and Lemeshow test, The p-value is very small. This indicates the model is a good fit. Secondly, we have here the area-under-curve of 0.7874, which is sightly less than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is a good fit. Finally, with the McFadden value 0.1890877, which is analgous to the coefficient of determination R\(2\), only about 18.9% of the variations in cardio is explained by the explanatory variables in the model. According to the three model evaluation, this logistic regression is a relatively ok model.
We further used Rigged and Lasso regression in our model to ensure that the model is not overfitting. In Lasso Regression, the coefficient ‘gluc’ is furthur dropped from the model we fitted using logistic regression to minimize overfitting.
By using KNN and selecting k=7, we could predict the target with accuracy of 68.5%. 7-nearest neighbors is an efficient choice because that’s the greatest improvement in predictive accuracy before the incremental improvement trails off.
In decision tree, we run the full model include gender, age, high blood pressure, low blood pressure, cholesterol level, bmi, glcose level, smoking, drinking alchol, and doing exercise. First, we develop visualize cross-validation results, cp is complexity parameter, provide the optimal prunings and we can prunue the tree to avoid any overfitting the data. CP value control the size of decision tree and select the optimal tree size. From graph, we can see that 4 variable is the best size for our model. Small cp value decrease relative error and increase accuracy. For our model, 3 variables have small relative error, decision tree preformance three variables. We use handy library to see accuracy, we can see the overall accuracy is 71.8%, it is ok model. In our fancyPlot, we can see the first importance variable is high blood pressure, second is age, third is cholesterol. For high blood pressure greater than 130, 40% people have cardiovascular diseases, For high blood pressure small than 130, age small than 55 years old, 36% people don’t have cardiovascular diseases. For high blood pressure small than 130, age large than 55 years old and have normal cholesterol level(cholesterol=1), 21% people don’t have cardiovascular diseases. For high blood pressure small than 130, age large than 55 years old and have abpve normal cholesterol level(cholesterol=2), 2% people have cardiovascular diseases. The result showed that blood pressure, age, and cholesterol level are important indicators to determine the CVD. This result consists of the logistic regression analysis which showed that blood pressure and cholesterol level are correlated with CVD.
Overall, this CVD database provides useful variables for us to conduct several chi-square tests, regression analysis, and model building.
Benjamin, E. J., Muntner, P., Alonso, A., Bittencourt, M. S., Callaway, C. W., Carson, A. P., . . . Stroke Statistics, S. (2019). Heart Disease and Stroke Statistics-2019 Update: A Report From the American Heart Association. Circulation, 139(10), e56-e528. doi:10.1161/CIR.0000000000000659
Bozeman, S. R., Hoaglin, D. C., Burton, T. M., Pashos, C. L., Ben-Joseph, R. H., & Hollenbeak, C. S. (2012). Predicting waist circumference from body mass index. BMC Med Res Methodol, 12, 115. doi:10.1186/1471-2288-12-115
Cardiovascular Disease [Web log post]. Retrieved Oct 12, 2019, from https://www.who.int/health-topics/cardiovascular-diseases/
Kathiresan, S., & Srivastava, D. (2012). Genetics of human cardiovascular disease. Cell, 148(6), 1242-1257. doi:10.1016/j.cell.2012.03.001